15 research outputs found
DSPatch: Dual Spatial Pattern Prefetcher
High main memory latency continues to limit performance of modern
high-performance out-of-order cores. While DRAM latency has remained nearly the
same over many generations, DRAM bandwidth has grown significantly due to
higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked
memory packaging (HBM). Current state-of-the-art prefetchers do not do well in
extracting higher performance when higher DRAM bandwidth is available.
Prefetchers need the ability to dynamically adapt to available bandwidth,
boosting prefetch count and prefetch coverage when headroom exists and
throttling down to achieve high accuracy when the bandwidth utilization is
close to peak. To this end, we present the Dual Spatial Pattern Prefetcher
(DSPatch) that can be used as a standalone prefetcher or as a lightweight
adjunct spatial prefetcher to the state-of-the-art delta-based Signature
Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of
modulated spatial bit-patterns. The key idea is to: (1) represent program
accesses on a physical page as a bit-pattern anchored to the first "trigger"
access, (2) learn two spatial access bit-patterns: one biased towards coverage
and another biased towards accuracy, and (3) select one bit-pattern at run-time
based on the DRAM bandwidth utilization to generate prefetches. Across a
diverse set of workloads, using only 3.6KB of storage, DSPatch improves
performance over an aggressive baseline with a PC-based stride prefetcher at
the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in
memory-intensive workloads and up to 26%). Moreover, the performance of
DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to
10% when DRAM bandwidth is doubled.Comment: This work is to appear in MICRO 201
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis
Profile hidden Markov models (pHMMs) are widely employed in various
bioinformatics applications to identify similarities between biological
sequences, such as DNA or protein sequences. In pHMMs, sequences are
represented as graph structures. These probabilities are subsequently used to
compute the similarity score between a sequence and a pHMM graph. The
Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these
probabilities to optimize and compute similarity scores. However, the
Baum-Welch algorithm is computationally intensive, and existing solutions offer
either software-only or hardware-only approaches with fixed pHMM designs. We
identify an urgent need for a flexible, high-performance, and energy-efficient
HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm
for pHMMs.
We introduce ApHMM, the first flexible acceleration framework designed to
significantly reduce both computational and energy overheads associated with
the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in
the Baum-Welch algorithm by 1) designing flexible hardware to accommodate
various pHMM designs, 2) exploiting predictable data dependency patterns
through on-chip memory with memoization techniques, 3) rapidly filtering out
negligible computations using a hardware-based filter, and 4) minimizing
redundant computations.
ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and
27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch
algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations
in three key bioinformatics applications: 1) error correction, 2) protein
family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x -
1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency
by 64.24x - 115.46x, 1.75x, 1.96x.Comment: Accepted to ACM TAC
Early Prediction of DNN Activation Using Hierarchical Computations
Deep Neural Networks (DNNs) have set state-of-the-art performance numbers in diverse fields of electronics (computer vision, voice recognition), biology, bioinformatics, etc. However, the process of learning (training) from the data and application of the learnt information (inference) process requires huge computational resources. Approximate computing is a common method to reduce computation cost, but it introduces loss in task accuracy, which limits their application. Using an inherent property of Rectified Linear Unit (ReLU), a popular activation function, we propose a mathematical model to perform MAC operation using reduced precision for predicting negative values early. We also propose a method to perform hierarchical computation to achieve the same results as IEEE754 full precision compute. Applying this method on ResNet50 and VGG16 shows that up to 80% of ReLU zeros (which is 50% of all ReLU outputs) can be predicted and detected early by using just 3 out of 23 mantissa bits. This method is equally applicable to other floating-point representations